This ipython file is the project by Hongyi Tang and Weijian Li for course 12752. There are four ipython files in the project in total. Each file consist of one cluster analysis task. In this file, the cluster analysis is demonstrated to 2 building types.


In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pickle

%matplotlib inline

Please download the dataset and change the file path.


In [2]:
# read in data from Commercial Building Energy Consumption Survey (CBECS)
data = pd.DataFrame.from_csv('C:/F16-12-752-master/projects/thongyi_weijian1/data/CBECS.csv') 
data.tail()


Out[2]:
REGION CENDIV PBA FREESTN SQFT SQFTC WLCNS RFCNS RFCOOL RFTILT ... FKCLBTU FKWTBTU FKCKBTU FKOTBTU DHHTBTU DHCLBTU DHWTBTU DHCKBTU DHOTBTU PUBCLIM
PUBID
6716 3 5 14 1.0 108000 7 1 6 2 1 ... 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN 2
6717 3 7 5 1.0 1700 2 5 5 2 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
6718 2 3 26 1.0 2000 2 1 4 2 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1
6719 1 2 12 1.0 19250 4 1 4 2 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
6720 3 5 14 1.0 142000 7 1 1 1 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2

5 rows × 1118 columns

In this case, only two building types which is office and inpatient health care is selected. And four energy consumptions elements are always the same for all four ipython files.


In [9]:
energydata=pd.DataFrame()

type_B=[2,16] # Office and inpatient health care are selected to demonstrated cluster analysis.
type_C=[1,3,4,5,6,7,8,9,10,11,12,13,14,15,17,18,19,20,21,22,23,24,25,26,91]

data_type=data
data_type=data_type[data_type.NGUSED!=2]
for i in type_C:
    data_type=data_type[data_type.PBA!=i]
energydata['Building Type']=data_type.PBA
index=['ELBTU','NGBTU','ELVNBTU','NGHTBTU'] # Annual Electricity Consumption, Annual Natural Gas consumption, Electricity Ventilation and Natural Gas Heating
for i in index:
    energydata[i]=data_type[i]/data_type.SQFT # Normalized all the data samples.

Before cluster analysis, any row has a zero value is wiped out. And the sample number is counted.


In [10]:
energydata = energydata.dropna(how='any')
energydata = energydata[~(energydata == 0).any(axis=1)]
PBA1=energydata['Building Type'].unique()
count=[]
for i in PBA1:
    count.append([energydata[energydata['Building Type']==i].shape[0],i])
count


Out[10]:
[[714, 2], [283, 16]]

The energy consumption pattern is plotted in box graph to help connect the cluster assignment to the building type information.


In [11]:
fig1 = plt.figure(figsize=(20,15))
times=1
data_seperate=[]
title=['Office', 'Inpatient Health Care']
# energydata[energydata['Building Type']==type_B[1]]
for i in range(len(type_B)):
    x=energydata[energydata['Building Type']==type_B[i]]
    x=x.drop(x.columns[0],axis=1)
    data_seperate.append(x) 
for i in range(len(type_B)):
    plt.subplot(len(type_B),2,times)
    data_seperate[i].boxplot()
    times=times+1
    plt.title(title[i],fontsize=20)
    plt.ylim(0,400)


C:\Users\thong\Anaconda3\lib\site-packages\ipykernel\__main__.py:12: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.

Cluster Analysis


In [15]:
y=pd.DataFrame()
for i in range(len(type_B)):
    y=y.append(data_seperate[i])
X=y.as_matrix().astype(np.float32)
from sklearn.cluster import KMeans
num_clust = 2
clusters = KMeans(n_clusters=num_clust).fit(X)
cluster_assignments = clusters.predict(X)
fig2 = plt.figure(figsize=(20,15))
for cluster_id in range(len(clusters.cluster_centers_)):
    plt.subplot(num_clust,2,cluster_id+1)
    cluster_members = X[cluster_assignments==cluster_id,:]
    print(len(cluster_members))
    for i in range(len(cluster_members)):
        plt.plot(cluster_members[i,:], color='grey', lw='0.1')
    plt.plot(clusters.cluster_centers_[cluster_id,:], color='k', lw='1')


723
274

Connect the cluster assignment to the building type and count the correctly assigned data samples.


In [16]:
y['assignment']=cluster_assignments
y=y.join(data['PBA'],how='inner')
y['judge']=1
y['judge'].iloc[np.where(np.array(y.PBA)==2)]=0
y['judge'].iloc[np.where(np.array(y.PBA)==16)]=1
y[y['judge']==y['assignment']].count()


C:\Users\thong\Anaconda3\lib\site-packages\pandas\core\indexing.py:132: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
Out[16]:
ELBTU         860
NGBTU         860
ELVNBTU       860
NGHTBTU       860
assignment    860
PBA           860
judge         860
dtype: int64

In [33]:
a=860/(723+274)

In [34]:
a


Out[34]:
0.8625877632898696

In [ ]: